The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn import metrics
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
#Read CSV
df = pd.read_csv("BankChurners.csv")
#Remove duplicate index, CLIENTNUM
df.drop("CLIENTNUM", inplace=True, axis=1)
print(df.shape)
df.head()
We have a pretty sizeable dataset here, with 10127 total rows and 20 columns (after removing CLIENTNUM). I can already tell some of them are going to need some pre-processing. Education_Level and Income_Category are currently categorical but can be made psuedo-numerical; I want to put numbers there to show how much education/income each row has relative to the rest of the dataset.
#Check for any immediate problems with data
df.info()
At least two columns have missing data: Education_Level and Marital_Status. We should also check the columns labeled "object", since those might have missing data in the form of "null strings". The rest seem safe: the data types are int or float and the non-nulls equal the number of total rows.
df.describe().T
#Separate columns into categorical and numerical lists for EDA
cat_columns = ['Attrition_Flag', 'Customer_Age', 'Gender', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category']
num_columns = [col for col in df.columns if col not in cat_columns]
for col in cat_columns:
sns.histplot(df[col])
plt.xticks(rotation=30)
plt.show()
df['Income_Category'].value_counts()
Attrition_Flag: This is the target variable, and we can see that it's imbalanced. We're trying to make a model that will predict customer attrition, but as things are a model that is trying to be as accurate as possible will be biased toward predicting the majority class. We will try some techniques later to resolve this.
Customer Age: Some spikes at the beginning and end suggesting some data is truncated, but these already represent the extremes so I don't see a problem here. Looks like a standard age distribution.
Gender: No surprises here.
Dependent_count: Again, looks fine.
Education_Level: These levels are not in order. I'd like to turn this category into "psuedo-numerical" with integers 1-6 so that they can be arranged in order of increasing education. I'm also a little surprised at the dominance of the "Graduate" category.
Marital_Status: Seems fine to me.
Income_Category: Like Education_Level, I'd like this to be arranged in order of increasing income. I can perhaps replace each value with the median of its limits (e.g. "60K - 80K" --> 70). We also have the "abc" values which are just missing.
Card_Category: Almost all customers are just using the Blue card. I expect this to not matter for most, but maybe the gold/silver/platinum will have different attrition rates.
In the end, we'll one hot encode all the categorical data so we are working with pure numbers.
#Fix the categorical data as mentioned above.
df['flagged'] = df['Attrition_Flag'] == "Attrited Customer"
df['flagged'] = df['flagged'].astype(int)
df = df.drop("Attrition_Flag", axis=1)
#Replace Education_Level with tiered categories. Missing values should be replaced by mode (Graduate).
educations = {
"Uneducated": 0, "High School": 1, "College": 2,
"Graduate": 3, "Post-Graduate": 4, "Doctorate": 5
}
df['Education_Level'].fillna(3, inplace=True)
df['Education_Level'].replace(educations, inplace=True)
sns.histplot(df['Education_Level'])
#Replace Income_Category with approximate average income for the category.
#At $120K+ it doesn't really matter how high it is as long as the model knows it's higher than the rest.
incomes = {
"Less than $40K": 30,
"$40K - $60K": 50,
"$60K - $80K": 70,
"$80K - $120K": 100,
"$120K +": 140
}
df["Income_Category"].replace(incomes, inplace=True)
#Regarding the "abc" values, I will replace them with the mode (30) since that's much more represented than the others.
df['Income_Category'].replace("abc", 30, inplace=True)
sns.histplot(df['Income_Category'])
"""
For Marital Status I'm not going to impute missing values. There is a mode (Married), but Single is nearly as
common. I'm going to do one hot encoding and just leave all the columns as 0 for the ones that are missing. The
models will just not have a 1 to work with on those rows.
"""
one_hot = pd.get_dummies(df['Marital_Status'])
df = df.drop("Marital_Status", axis=1)
df = df.join(one_hot)
df = df.join(pd.get_dummies(df["Gender"])).drop("Gender", axis=1)
df = df.join(pd.get_dummies(df["Card_Category"])).drop("Card_Category", axis=1)
df
Now it's time to invesigate the numerical variables. We can use "flagged" as a hue to get a hint about their relationship to the target variable.
for col in num_columns:
sns.boxplot(df, x="flagged", y=col)
plt.xlabel(col)
plt.ylabel("")
plt.show()
Dependent_count: A slight difference, with flagged customers having slightly more dependents on average.
Months_on_book: Almost no difference.
Total_Relationship_Count: Flagged customers having slightly fewer products held on average.
Months_Inactive_12_mon: Unsurprising that flagged customers have more inactive months on average.
Contacts_Count_12_mon: Flagged customers do have more contacts with the bank on average.
Credit_Limit: Little difference.
Total_Revolving_Bal: Pretty significant difference; flagged customers have much less revolving balance.
Avg_Open_To_Buy: Nothing. Total_Trans_Amt: Pretty similar except high outliers aren't represented amongst flagged customers.
Total_Trans_Ct: One of the better predictors so far. Flagged customers have far fewer transactions.
Total_Ct_Chng_Q4_Q1: Slight difference. Lower on average for flagged customers.
Avg_Utilization_Ratio: Pretty good test, as well. Most flagged characters have close to zero utilization, with many having actually zero.
Finally, we'll do a quick overview of relationships between features. A pair-plot is oging to be really large but we might be able to glean some insight about the relationships between some of these features.
sns.pairplot(df)
A lot of the categorical variables are nearly perfectly uncorrelated with each other, but the numerical variables do have some connection. Specifically, "Credit Limit" and "Avg_Open_To_Buy" are nearly perfectly correlated. We might do well to remove one of those because they're effectively duplicates. This makes sense to me because Avg_Open_to_Buy is just the amount of the total Credit Limit that is still available to be used, which should be redundant when combined with Total Revolving Balance.
df = df.drop("Avg_Open_To_Buy",axis=1)
The other relationships seem more natural. Credit Limit and Utilization Ratio are negatively correlated. Transaction Count and Transaction Amount are positively correlated (duh), but not nearly as cleanly as you might expect. The amount per transaction might vary a bit. I wonder if it's worth putting in a new column for that ratio.
df['amount_per_transaction'] = df["Total_Trans_Amt"] / df["Total_Trans_Ct"]
sns.boxplot(x=df['flagged'], y=df['amount_per_transaction'])
This is actually kind of curious because both Transaction Amount and Transaction Count were both different for Existing vs. Attrited customers. Yet the ratio of amount/count is close to the same either way.
Questions:
Right-skewed, but especially so for Existing Customers.
Most are at the Graduate level, with some not reaching that level. Post-Graduate education is somewhat rare.
Most of the customers are relatively poor, with <$40k being the most common category by far.
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?While the majority of customers are below 1.0 (meaning higher in Q4), having a ratio greater than 1.0 was much more likely to be associated with a non-Attrited customer.
Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?Surprisingly, there was a lot of overlap in this category between the two types of customers. They both had similar numbers of inactive months, but the spread of the distribution was greater for Existing customers (not surprising, since there are more of them).
Credit Limit and Utilization Ratio are negatively correlated. Transaction Count and Transaction Amount are positively correlated (duh), but not nearly as cleanly as you might expect. The amount per transaction might vary a bit.
#Designate target variable as Y, the rest as X.
y = df['flagged']
X = df.drop("flagged", axis=1)
#Split the data into three parts: training, validation, and testing.
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
Sample code for model building with original data
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
We can see that some models tend toward overfitting more than others. Bagging, Random Forest, and Decision trees all have much better performance in test than in validation. GBM and AdaBoost are not overfit, which is a great sign.
We can potentially do better by rebalancing the target data. We will try oversampling and undersampling.
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before Over Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Over Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Over Sampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Over Sampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Over Sampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Over Sampling, the shape of train_y: {} \n".format(y_train_over.shape))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
While performance is better overall on the training data, mostly what this has done is caused more overfitting. Our Recall scores are up compared to before, but we can tell that even our GBM and Adaboost models are overfit. It's not too bad, though; I would still take 91% recall from GBM compared to the 85% it was before.
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
This is the best results we've seen so far. Up to 95% recall on the validation performance, and no overfitting on GBM/Adaboost (the others are still overfit, but for the trees that's likely because we haven't pruned them - this can happen in the next step).
# defining model
model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV, on undersampled training data
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
#Check performance of this model on the validation set
model = RandomForestClassifier(**randomized_cv.best_params_, random_state=1)
model.fit(X_train_un,y_train_un)
print(f"Validation Recall: {recall_score(y_val, model.predict(X_val))}")
# defining model
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
#Check performance of this model on the validation set
model = GradientBoostingClassifier(**randomized_cv.best_params_, random_state=1)
model.fit(X_train_un,y_train_un)
print(f"Validation Recall: {recall_score(y_val, model.predict(X_val))}")
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
#Check performance of this model on the validation set
model = AdaBoostClassifier(**randomized_cv.best_params_, random_state=1)
model.fit(X_train_over,y_train_over)
print(f"Validation Recall: {recall_score(y_val, model.predict(X_val))}")
Of the ones tuned, the best performance belongs to the Gradient Booster Classifier on Undersampled data, with a 95% recall score on the validation set (the training was 94%, so not overfit either). Therefore this is the model I will choose.
Let's see how it performs against the heretofore unseen test data!
finalmodel = GradientBoostingClassifier(random_state=1)
params = {'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05}
finalmodel.fit(X_train_un,y_train_un)
y_pred = finalmodel.predict(X_test)
print(f"Test Set Recall: {recall_score(y_test, y_pred)}")
Looks pretty good! Now let's break down the model for insights.
model_performance_classification_sklearn(finalmodel, X_test, y_test)
confusion_matrix(y_test, y_pred)
We correctly identified 313 out of 325 (96%) of Attrited customers. There were 122 False Positives, but in this business application there is not as much of a cost here.
When a customer is flagged as a possible attrition candidate, the business's response should be to target that individual with incentives to remain a customer. This could be special deals offered to them, perhaps, or some other form of specialized treatment. Even if we flag a customer as potential attrition, it's not a heavy cost to offer the same treatment. However, not offering these deals and losing a customer as a result can be expensive. This is why we prioritized recall as our scoring metric.
Finally, let's look at which features ended up being the most determinant for the predictions.
feature_names = X_train.columns
importances = finalmodel.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
This case is good evidence to suggest that simpler explanations are better. If you're worried that you're losing a customer, you can look at a whole bunch of stuff like marital status, gender, and education. But next to actual usage data, it's clear they aren't as important. The most important features were just ... how many (and for how much) transactions did the customer engage in for the last 12 months? Turns out, those who had fewer transactions were more at risk of becoming Attrited -- and of course that's the case.